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ABSTRACT 


Recent advances in high-throughput sequencing 
technologies have revolutionized the field of 
population genetics. Data now routinely contain 
genomic level polymorphism information, and the 
low cost of DNA sequencing enables researchers to 
investigate tens of thousands of subjects at a time. 
This provides an unprecedented opportunity to 
address fundamental evolutionary questions, while 
posing challenges on traditional population genetic 
theories and methods. This review provides an 
overview of the recent methodological developments 
in the field of population genetics, specifically 
methods used to infer ancient population history and 
investigate natural selection using large-sample, 
large-scale genetic data. Several open questions are 
also discussed at the end of the review. 
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INTRODUCTION 


A central goal of evolutionary biology is to understand the 
mechanisms of how natural selection and other factors, such as 
random drift and mutation, drive the evolutionary process. 
Population geneticists address these questions quantitatively by 
building mathematical models, developing statistical methods 
for inferring parameters of ancestral processes, and testing 
hypotheses based on the analysis of real data. With the 
development of sequencing technologies in recent years, large- 
sample, large-scale genetic polymorphism data from humans 
and other species have increased greatly (Mardis, 2008), 
bringing valuable resources to address evolutionary questions. 
However, the computational capacities of traditional population 
genetic methods only allow their applications to small samples 
and/or local chromosome regions. Methodological development 
in population genetics has reacted to the emergence of these 
large-sample, large-scale genetic data. This review provides a 
summary of methodological advances that aim to answer two 
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fundamental questions in population genetics, specifically 
ancient demography inference and natural selection detection. 


LEARNING ANCIENT DEMOGRAPHY 


In genetic studies of human evolution, mitochondrial DNA (mt- 
DNA) and the Y chromosome have been invaluable markers for 
reconstructing the history of modern humans (Cavalli-Sforza & 
Bodmer, 1971). In their seminal paper, Cann et al (1987) 
constructed a phylogenetic tree of 147 individuals from five 
human populations using mt-DNA sequences, and inferred the 
time of the common female ancestor to be 200 Kyrs. Since mt- 
DNA is maternally inherited, Cavalry-Sforza and colleagues 
sequenced the non-recombining region of the Y chromosome 
(NRY) and developed a system for investigating paternal origins 
(Underhill et al, 2001). mt-DNA and NRY are especially useful 
for constructing population history, since the non-recombining 
regions of mt-DNA or NRY share a single gene genealogy, 
enabling the inference of a common phylogenetic tree. Using 
mt-DNA and NRY, population geneticists can determine the 
migration route and dispersal of global human populations. 
Many successful applications of this approach can be found in 
studies of East Asian populations (Jin & Su, 2000; Ke et al, 
2001; Kong et al, 2003; Yao et al, 2002; Zhang et al, 2013; 
Zhao et al, 2009). 

Most genetic polymorphisms are, in fact, stored in autosomal 
regions. Due to recombination along chromosomes, two distant 
autosomal regions from a sample are likely to evolve with 
independent genealogies. These gene genealogies represent 
independent realizations of the evolutionary process given the 
underlying demographic history, and thus are more informative 
for demographic inference than mtDNA or NRY, which only 
carry information of single gene genealogy. One approach to 
exploring population structure using autosomal polymorphisms 
is principal component analysis (PCA). This approach 
decomposes the covariance matrix that summarizes the 
correlations among individuals or populations, and efficiently 
extracts the main structure of the data. It then presents the 
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population structure by plotting a few leading principal 
components against each other. This method was introduced 
by Cavalli-Sforza & Bodmer (1971) to the field of population 
genetics, and was very useful for investigating population 
structure using abundant genomic polymorphism from multiple 
populations (Patterson et al, 2006). 

Recently, other than the exploratory analysis of population 
structure using PCA, population genetic studies have gained 
more insights into demographic history from population 
genetic models. Methods for population history inference 
have been developed using different aspects of genomic 
polymorphism, mostly based on: (1) the allele frequency 
spectrum (AFS, alternatively "site frequency spectrum", 
SFS), and (2) linkage disequilibrium (LD) or haplotype 
structure (Table 1). 


Allele frequency spectrum-based methods 

The AFS is a sampling distribution of alleles in a finite sample 
(Chen, 2012), and focuses on the allele frequency distribution 
of a single locus, ignoring the correlation among nearby loci. 
Such approximation greatly simplifies theory and methodology 
development. AFS theory was developed in two parallel 
frameworks: the diffusion (Kimura, 1955) and coalescent 
processes (Fu, 1995). 

Theoretical studies on a single-population AFS started 
from stationary populations of constant size, and later were 
extended for populations with time-varying size (Griffiths & 
Tavaré, 1998), including exponentially growing populations 
(Evans et al, 2007; Polanski & Kimmel, 2003; Wooding & 
Rogers, 2002) and piecewise constant populations (see the 
N-epoch model describing population size change in Marth et 
al, 2004). The AFS for populations with time-varying size 
(non-equilibrium populations) was applied to real data to infer 


ancient Asian and European demography (Keinan et al, 2007). 


Recently, the AFS of one population was extended to the joint 
allele frequency spectrum (JAFS) of multiple populations. 
Gutenkunst et al (2009) obtained the numerical solution of a 
three-dimensional diffusion equation, and applied the method to 
infer the joint demographic history of three world populations: 
West European (CEU), East Asian (HAN) and West African 
(YRI). Their software ðaði has been extensively applied to 
analyze genomic data, and uses the finite difference approach 
to obtain a numerical solution, and thus computational 
complexity is a function of population size and increases 
exponentially with population number. The computation 
becomes intensive when the population number is large or the 
population is under rapid growth. Lukić et al (2011) reduced 
computational complexity by using spectral methods, and 
successfully applied their method to infer the history of four 
world populations (Lukić & Hey, 2012). 

The above two JAFS approaches rely on numerical solutions 
of diffusion equations. Others have used coalescent simulations 
to approximate the JAFS of two or several populations 
(Excoffier et al, 2013; Li & Stephan, 2006). Chen (2012) derived 
an analytical form of the JAFS for multiple populations using 
coalescent theory. Their method incorporated various scenarios 
including time-varying population size, instantaneous migration 
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and the hitch-hiking effect. Compared to the diffusion-based 
approach, computational complexity of the coalescent-based 
method is reduced to a function of sample size, which is much 
more efficient. 


Haplotype structure-based methods 

Another group of methods consider linkage disequilibrium, or 
the correlation of gene genealogies of adjacent sites. For 
example, the pairwise sequential Markovian coalescent method 
(PSMC, Li & Durbin, 2011) uses a hidden Markov model to 
approximate the dependency of the coalescent times of two 
haplotypes between adjacent loci, and further infers the detailed 
ancient population size from coalescent times. The limitation of 
the PSMC method is that it can only analyze one diploid 
genome. Burgess & Yang (2008) inferred ancient population 
size by analyzing multiple sequences, with each sequence 
representing one population. They developed a Markov chain 
Monte Carlo approach (MCMCCoal) to sample gene 
genealogies. Gronau et al (2011) modified MCMCCoal to allow 
two sequences from each sampled population, and applied 
their G-PhoCS method to infer the joint demography of four 
world populations. The Coal-HMM method (Hobolth et al, 2007) 
can also analyze multiple genomes from several populations. 
Instead of sampling over gene genealogies as per MCMCCoal, 
Coal-HMM treats the unobserved gene genealogy at each 
genomic position as unobservable latent states in a hidden 
Markov model. The method in Mailund et al (2011) is similar to 
the PSMC method, but can analyze two sequences from two 
populations. 

Extending the above methods to the analysis of a large or 
even medium number of individual genomes is more 
challenging. Lohse et al (2011) used a probability generation 
function to infer the coalescent times of multiple individuals, and 
then population demography. Harris & Nielsen (2013) 
investigated the extent of shared IBD (identical by descent) 
tracts between each pair of haplotypes, and applied their 
method to a two-isolated-population model with migration. The 
diCal method in Sheehan et al (2013) was built on the 
sequential Markovian coalescent process and improved 
computation by proposing more efficient importance-sampling 
proposal distributions. Recently, Schiffels & Durbin (2014) 
extended Li and Durbin's PSMC method to the multiple 
sequential Markovian coalescent (MSMC) method, which can 
deal with multiple individual genomes from two populations. The 
possible number of gene genealogies increases dramatically 
when multiple sequences are included, and computation again 
becomes intensive. MSMC tackles the problem by focusing 
only on some summary statistics of the genealogies, such as 
first coalescent time of any two sequences and total length of all 
singleton branches of the genealogy. 

Overall, the existing haplotype-based methods can analyze a 
small number of individual genomes, but are quite powerful for 
inferring ancient history. For example, the PSMC method works 
well for learning population size between 20-200 Kyr (Li & 
Durbin, 2011). The remaining challenge is to efficiently 
approximate the sequential Markovian coalescent or the 
ancestral recombination graph for larger samples. 
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Demography of East Asian populations 

The AFS and haplotype-based methods have been applied to 
genomic data to infer demographic histories of humans and 
other species. In human studies, the demographic history of 
Western Europeans was the main focus due to the abundance 
of sequence data for European populations. Several studies 
inferred East Asian demographic history using the HapMap or 
the Thousand Genomes Project data (Gravel et al, 2011; 
Gronau et al, 2011; Li & Durbin, 2011; Schaffner et al, 2005; 


Schiffels & Durbin, 2014). The inferred parameters of these 
studies are detailed in Table 2. These studies identified major 
events in East Asian history; for example, Keinan et al (2007) 
analyzed the AFS of the HapMap HCB samples and estimated 
a severe bottleneck in East Asian populations with an intensity 
(defined as T/2N) of 0.123+0.015. While detailed knowledge on 
Asian population history is still limited, understanding East 
Asian demographics at a finer level is useful for disease studies, 
and relies on the availability of genomic polymorphism data. 





Table 3 Inferred demographic history of East Asians by different studies 
Parameter Schaffner (2005) Gravel (2011) Gronau (2011) Schiffels (2014) 
Current pop size 100 000 45 521 4 100 >1 000 000 
Growth rate - 0.48% - - 
Eurasian pop size 7 700 1 861 1 000 ~1 200 
Eurasian split time 50.0 Kyr 23.0 Kyr 38.0 Kyr 20-40 Kyr 
African effective size 24 000 14 474 141 000 ~15 000 (at 110 Kyr) 
Out-of-Africa time 87.5 Kyr 51.0 Kyr 49.0 Kyr 60-80 Kyr 


DETECTING NATURAL SELECTION 


Methods 

Detecting the effects of natural selection and identifying the 
selected loci in the genome is another research hotspot in 
population genetics. Natural selection, especially positive 
selection (aka, selective sweeps) generates distinctive genetic 
polymorphism patterns in contemporary populations. Statistical 
approaches have been constructed using some informative 
aspects of these patterns. 


Allele frequency spectrum When natural selection drives a 
selected mutant to fixation in a population, the allele 
frequencies of SNPs in the vicinity of the selected mutant are 
also affected. Their frequencies can be increased, if the alleles 
are linked to the selected mutant, or decreased otherwise. This 
is known as the hitch-hiking effect. Smith & Haigh (1974) 
provided a deterministic equation to approximate this effect 
(Fay & Wu, 2000). Recent theoretical studies on the selective 
coalescent process derived more accurate sampling formulas 
for modeling the hitch-hiking effect (Durrett & Schweinsberg, 
2004; Etheridge et al, 2006). The sampling formulas can be 
used to derive the AFS of a linked SNP under hitch-hiking. 
Methods for detecting selection can then be constructed by 
combining information of multiple SNP loci with a composite- 
likelihood approach. As an approximation of the full likelihood, 
the composite likelihood of multiple loci is obtained by taking the 
product of marginal likelihood of individual loci. The AFS-based 
methods include SweepFinder for a single population (Nielsen 
et al, 2005), and the JAFS method of multiple populations 
(Chen, 2012). 

Summary statistics, such as, Tajima's D (Tajima, 1989), Fu 
and Li's F* (Fu & Li, 1993) and Fay and Wu's H (Fay & Wu, 
2000) are known as neutral tests for selection. These statistics 
are summaries derived from a single-population AFS. For 
example, a negative D value indicates a skewed AFS with 
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overrepresented singletons, while a positive D indicates an 
enrichment of segregating sites with medium frequencies. A 
likelihood approach using a detailed AFS has more power than 
neutral tests based on summary statistics. 

AFS methods are powerful for detecting selection, but have 
several limitations: (1) population history can confound the 
effect of selection on the AFS. For example, a recent rapid 
population growth increases the relative abundance of rare 
alleles, which is similar to the effect of negative selection. 
However, since demographic effects are genome-wide, it is 
possible to control for these effects by explicitly modeling 
demographic history in the methods (Li & Stephan, 2006; 
Williamson et al, 2005); (2) AFS methods are not robust to SNP 
ascertainment bias (Clark et al, 2005). SNP data generated 
from platforms designed under complex ascertainment 
schemes, such as SNP arrays, are unfit for AFS methods. With 
more ascertainment-free genomic data from NGS technology, 
AFS-based methods are expected to become more applicable. 


Haplotype structure When a selected allele is increased to 
high frequency, the ancestral haplotypes carrying the selected 
alleles are also increased. The time interval for the selection 
process is short enough that the ancestral haplotypes are not 
broken by recombinations, and thus long extent haplotypes in 
the vicinity of the selected allele can be observed. Such a 
haplotype structure can be used to test for recent positive 
selection when the selected allele is still segregating. Several 
methods were developed based on such haplotype structure, 
including the EHH test (Sabeti et al, 2002), iHS test (Voight et al, 
2006), and the hidden Markov model approach (Chen et al, 
2015). The idea was further extended to the comparison of 
haplotype homozygosity between two populations (eg. 
XPEHH test, Sabeti et al, 2007; Tang et al, 2007). These 
haplotype-based methods are robust to SNP ascertainment 
schemes, and are very useful for SNP array data designed 
under complicated ascertainment schemes. 


Population differentiation Using population differentiation to 
detect selection is based on the fact that if the gene is under 
local adaptation in one population, its frequency divergence 
among populations should be highly beyond the genomic level. 
The fixation index, Fst, which measures the divergence 
between two populations at a single locus, was adopted by 
Lewontin & Krakauer (1973) to detect selection. A moments- 
based estimator of Fst developed by Weir & Cockerham (1984) 
has been commonly used when sample sizes from two 
populations are unbalanced and has been applied to a genome 
scan for selected loci (Akey et al, 2002). 

Fst is calculated for a single locus and has a large variance. 
The random fluctuation of Fst values across SNP loci causes 
high false positive rates. Combining the incremental effects of 
selection on multiple loci helps reduce the false positive rates 
and increase power. The XP-CLR test by Chen et al (2010) was 
developed in light of this principle. It explicitly models the decay 
of population differentiation as a function of genetic distance 
between the neutral marker and selected mutant, and uses a 
composite likelihood scheme to combine the effects of multiple 
loci. Other methods that make use of population differentiation 
include the locus-specific branch test (LSBL) and its variants 
(Shriver et al, 2004; Yi et al, 2010), though they are single 
locus-based. 

The above three classes of methods utilize different aspects 
of data patterns. Grossman et al (2010) attempted to combine 
multiple genetic signals to reduce the false positive rates in 
detecting selection (composite of multiple signals test, CMS). 


Natural selection in East Asians 

Modern humans faced environmental changes and infectious 
diseases when they migrated out of Africa and colonized other 
places in the world. Natural selection was very likely essential 
during this process. Specifically, East Asian environments were 
extremely divergent in terms of climatic factors, such as UV 
light, temperature and altitude, making Asian populations ideal 
for studying natural selection (Shi & Su, 2011). 

Shi et al (2009) hypothesized that the P53-MDM2 pathway 
may be important for the adaptation to low temperature when 
East Asians moved from the south towards high-latitude areas. 
Tibetans have lived on the Himalayan Plateau for tens of 
thousands of years. Recent genomic studies identified several 
genes conferring hypoxia adaptation, among which EPAS1 
shows the strongest signal of Tibetan-specific selection (Beall et 
al, 2010; Peng et al, 2011; Simonson et al, 2010; Wang et al, 
2011; Xu et al, 2011a; Yi et al, 2010). Xiang et al (2013) recently 
identified one functional mutant in another gene EGLN7. 
Interestingly, EGLN1 and EPAS are both of the hypoxia 
pathway with direct interaction. 

EDAR shows an extremely strong signal of recent positive 
selection in East Asian (Sabeti et al, 2007). Kamberov et al 
(2013) used mice to show that a non-synonymous mutation 
could cause phenotypical changes in the skin. The mechanism 
underlying selection on EDAR remains unclear, though it was 
hypothesized to be due to the high humidity in Eastern Asia. 

Life styles, such as the transition from hunter-gather to an 
agricultural society might also be important driving forces. One 


gene related to lifestyle transition is ADH (Li et al, 2008; Peng 
et al, 2010). The ADH1B*47His allele in East Asians shows 
signals of selection and its geographic frequency distribution is 
consistent with cultural relic sites of rice domestication in China, 
indicating that the ADH gene might be related to the potential 
benefits of fermented beverages and food. 


Artificial selection in domestication 

Domestication and animal and plant breeding have been 
ongoing since the origin of agriculture around 10 000 years ago. 
Artificial selection shaped the traits of domesticated species to 
meet human demands during this process (Li & Zhang, 2009). 
Investigating selection on domestication traits helps identify the 
genetic architecture of these traits and improve breeding 
(Doebley et al, 2006). In recent years, such endeavors have 
been facilitated by genomic sequencing technology (Doebley et 
al, 2006; Fang et al, 2009; Huang et al, 2012; Hufford et al, 
2012; Lyu et al, 2013, 2014; Qi et al, 2013a; Xu et al, 2011b; 
Zhou et al, 2015). 

Most such studies identified a list of selected loci, and then 
checked their overlap with domestication QTL loci or some 
meaningful GO categories and pathways. Xia et al (2009) 
sequenced 40 domesticated and wild silkworms. They identified 
signals of selection at 354 candidate genes possibly important 
during domestication, some of which have enriched expression 
in the silk gland, midgut and testis. Wang et al (2013) studied 
natural selection in dog genomes, and identified genes related 
to starch digestion and metabolism, including nutrient transport 
and regulation of the digestion process. This reflects an 
agricultural living condition during the domestication history of 
dogs. 

Zhou et al (2015) analyzed genomic sequences of 62 wild 
soybeans, 130 landraces and 110 improved cultivars. They 
identified 121 domestication-selective sweeps and 109 
improvement-selective sweeps. Among these selected targets, 
some were related to morphological features, such as seed size 
and color, seed weight, stem determinacy, flower color, seed 
coat color and pubescence form. In addition to morphological 
traits, more than 90 sweep targets were located within known 
oil QTL regions. 

Results from these genomic studies shed light on the 
fundamental mechanisms of the artificial selection process. For 
example, Hufford et al (2012) analyzed genomic sequences of 
75 wild landraces and improved maize, and found evidence for 
stronger selection during domestication than improvement and 
that artificial selection was common in regulatory regions, which 
was confirmed by transcriptome analysis. 


Inferring selection intensity, allele age and fixation time 

In addition to identifying loci under selection, population 
geneticists are also interested in knowing further details of the 
selection process, such as, when natural selection initialized 
and how to quantify selection intensity. A detailed portrait of the 
selection process provides hints for deciphering the mechanism 
of selection, and validates anthropological hypotheses. For 
example, in studies on Tibetan high altitude adaptation, Peng et 
al (2011) and Xiang et al (2013) inferred the allele ages of 
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EPAS and EGLN7 using haplotype structure. Although both 
genes were under strong selection, the estimated times were 
different: selection on EPAS7 started around 20 Kyr ago and 
selection time of EGLN1 was only about 7 Kyr. Interestingly, the 
two selection times are consistent with the two waves of 
population migrations to the Tibetan Plateau (Qi et al, 2013b; 
Zhao et al, 2009). Based on this, Xiang et al (2013) proposed a 
two-step hypothesis on the development of Tibetan adaptation 
to high altitude. 

For selected alleles still segregating in the population, e.g., 
EPAS1 and EGLN7, Chen A Slatkin (2013) proposed a method 
for inferring selection intensity and allele age using haplotype 
structure. Their method relies on importance sampling 
algorithms to sample from the genealogical space and allele 
frequency trajectories, which requires very intensive 
computation. The method can only analyze a local region for a 
small number of individuals. Chen et al (2015) developed a 
hidden Markov model for investigating the haplotype structure 
around the selected mutant, and provided a_ simplified 
population genetic model for inferring the parameters. The 
simplified model is much more efficient, and can be applied to 
genome-wide analysis for large samples. 

If selection occurred anciently, the selected allele may have 
been fixed in the population. The parameter of interest for 
ancient selection is the time since fixation. For example, the 
genetic changes underlying the emergence of speech and 
language in modern humans, believed to be under strong 
selection, were inferred to be fixed during the last 200 Kyrs 
(Enard et al, 2002). Linnen et al (2009) studied cryptically 
colored deer mice living on the Nebraska Sand Hills and 
showed that their light coloration was caused by a cis-acting 
mutation closely linked to a single amino acid deletion in Agouti. 
The fixation time of the mutant was 8-10 Kyrs ago. 

To date, only a few methods for inferring fixation time of 
selection have been proposed. The Bayesian approach by 
Przeworski (2003) simulates samples under selection, and 
matches a list of summary statistics between simulated and real 
data with rejection sampling. The method integrates over all 
possible values of selection intensity, and provides the posterior 
distribution of fixation time. Linnen et al (2009) used a two-step 
scheme to infer fixation time, first assuming the fixation time 
was 0 and estimating the selection intensity using the AFS of all 
SNPs from the gene region (Kim & Stephan, 2002). Fixing 
selection intensity to the estimated value, they used the 
Bayesian approach of Przeworski (2003) to obtain the posterior 
distribution of fixation time. However, the above two methods 
fail to jointly estimate selection intensity and fixation time. Chen 
(2012) modeled the pattern of the allele frequency spectrum of 
SNPs linked to a selected mutant as a function of selection 
intensity and fixation time, and efficiently estimated both 
parameters. 


CHALLENGES AND FUTURE STUDIES 
With the rapid development of sequencing technology, large- 


sample or even population-scale sequencing data have 
become available. For example, Coventry et al (2010) 
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sequenced two genes (KCNJ11 and HHEX) for 10 422 
European Americans and 3 293 African Americans. Nelson et al 
(2012) and Fu et al (2013) conducted exome sequencing for 
several thousand individuals. Large sample genomic data 
provide valuable resources for population genomic studies. 
However, the computational capacities of traditional population 
genetic methods only allow their application to local regions 
and/or small samples. Developing computationally efficient 
methods capable of analyzing large-sample, large-scale 
genomic data is necessary and challenging. Some 
computational issues, such as controlling for data quality, 
especially for sequencing data with low coverage, are important 
for population genetic inference, but beyond the scope of this 
review. Discussions on this topic can be found in the literature 
(e.g., Han et al, 2014; Jiang et al, 2009; Johnson & Slatkin, 
2006; Liu et al, 2010). 


Computational challenges 

One main obstacle prohibiting existing population genetic 
methods from application to large genomic data is intensive 
computation. The likelihood function of most population genetic 
methods has gene genealogy as a nuisance variable. These 
methods evaluate the likelihood function by adopting Markov 
chain Monte Carlo (MCMC) or importance sampling (IS) to 
integrate over the gene genealogy space. It is computationally 
very intensive, and only works for a small sample of haplotypes 
from a local chromosome region (Griffiths & Tavare, 1994). 
Such methods cannot be directly scaled up to large-sample 
genomic data even with high performance computers. 
Developing efficient computing algorithms is necessary. 

Another issue is numerical instability. For example, an 
essential component in coalescent-based methods is the 
distributions of coalescent time and ancestral lineage numbers. 
Both equations are expressed as a function of alternating hypo- 
geometric series. When the sample size is large (e.g., n>100), 
the coefficient of each individual term of the series becomes so 
large that it is beyond the capacity of double-precision variables 
of any computer language. One scheme to avoid such 
numerical overflow is to use a high-precision arithmetic library 
(HPAL) in programming. This significantly increases 
programming difficulty and computing time with only a limited 
improvement in performance. A more applicable solution is to 
replace the exact distributions with their asymptotic distributions 
(Chen & Chen, 2013; Griffiths, 1984). Asymptotic formulas are 
usually in simple analytical form and easy to calculate. 


Methods for detecting soft sweeps and polygenetic 
selection 

Some questions are raised from a biological view instead of 
computational issue. One such issue is the revision of our views 
on the general forms of natural selection. After more than ten 
years of genome-wide studies on selective sweeps in humans, 
only a few genes have been identified under strong selection 
due to extreme environmental factors (EPAS7) or infectious 
diseases (G6PD). This is in conflict with our traditional 
understanding and urges us to reflect and explore the actual 
general form of adaptation in nature. Recently, population 


geneticists hypothesized that other forms of selection, such as, 
soft sweeps and polygenetic selection, are likely to be more 
common in nature and are under the radar of existing genomic 
approaches (Pritchard et al, 2010; Wollstein & Stephan, 2015). 

Most conventional methods for detecting selective sweeps 
were developed by assuming selection starts from a de novo 
mutation. Such a selective process is called a hard sweep. If 
selection starts from a standing mutant, which has been in the 
population under neutrality for a long time and is in high 
frequency, it is a soft sweep (Hermisson & Pennings, 2005). 
Researchers hypothesized that soft sweeps are more prevalent 
than hard sweeps (Pritchard & Di Rienzo, 2010). The chance 
for a new advantageous mutant to occur is very small, and it is 
also very unlikely that the new advantageous mutant can 
survive the effect of random drift in the early stage of the 
selective process to finally reach high frequency in the 
population. Ohta & Kimura (1975) already noticed this in their 
seminal paper on the hitch-hiking effect: “It is likely that the new 
advantageous allele will be chosen, in response to 
environmental changes, from the pre-existing alleles rather than 
occurring by mutation”. 

Although soft sweeps are more common, it is not trivial to 
propose a powerful method for detecting soft sweeps. The 
genetic polymorphism pattern caused by soft sweeps is 
indistinguishable from that under neutrality in many aspects, 
including the allele frequency spectrum, reduction of genetic 
diversity, and linkage disequilibrium. This explains very few 
methods for detecting soft sweep so far (Przeworski et al, 2005; 
but see Garud et al, 2015). 

To date, genomic studies on selection have focused on a 
single locus, for example, lactase persistence. Some traits, 
such as skin pigmentation, are determined by several major 
genes and show an evolutionary mode similar to the single- 
gene cases. However, most traits are quantitative and 
determined by multiple genes with minor effects and complex 
interactions. “It seems likely to us that, as in traditional 
quantitative genetic models, many -- possibly even most -- 
adaptive events in natural populations occur by polygenic 
adaptation” (Pritchard & Di Rienzo, 2010). Such traits, when 
under natural selection, tend to evolve in a polygenic mode: one 
could expect that multiple functional loci shift their allele 
frequencies without being fixed when the population fitness is 
improved by natural selection (Hancock et al, 2010). Our 
understanding of polygenic selection is in the early stages, and 
as pointed out by Pritchard & Di Rienzo (2010), empirical study 
and theoretical modeling are both needed to understand the 
mechanism of polygenic selection. 
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